Portland housing data - exploratory data analysis¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
# import seaborn as sns
%matplotlib inline

from arcgis.gis import GIS
# from arcgis.features import GeoAccessor, GeoSeriesAccessor
gis = GIS()

Read data¶

In [8]:
csv_path = '/Users/atma6951/Documents/GIS_data/Analytics/PDX/houses_for_sale_filtered.csv'
prop_df = pd.read_csv(csv_path)
prop_df.head(3)
Out[8]:
Unnamed: 0 Unnamed: 0.1 SALE TYPE PROPERTY TYPE ADDRESS CITY STATE ZIP PRICE BEDS ... YEAR BUILT DAYS ON MARKET PRICE PER SQFT HOA PER MONTH STATUS URL SOURCE MLS LATITUDE LONGITUDE
0 0 0 MLS Listing Single Family Residential 3445 NE Marine Dr Unit BH04 Portland OR 97211.0 27500.0 0.0 ... NaN 53.0 79.0 NaN Active http://www.redfin.com/OR/Portland/3445-NE-Mari... RMLS 18567126 45.600583 -122.628508
1 3 3 MLS Listing Mobile/Manufactured Home 6112 SE Clatsop St Portland OR 97206.0 35000.0 2.0 ... 1981.0 93.0 44.0 NaN Active http://www.redfin.com/OR/Portland/6112-SE-Clat... RMLS 18331169 45.461282 -122.600153
2 8 8 For-Sale-by-Owner Listing Single Family Residential 6901 SE Oaks Park Way Slip 32 Portland OR 97202.0 60000.0 1.0 ... 1939.0 359.0 55.0 NaN Active http://www.redfin.com/OR/Portland/6901-SE-Oaks... Fizber.com 4933081 45.474369 -122.662307

3 rows × 24 columns

In [9]:
prop_df.shape
Out[9]:
(3894, 24)
In [10]:
prop_df.columns
Out[10]:
Index(['Unnamed: 0', 'Unnamed: 0.1', 'SALE TYPE', 'PROPERTY TYPE', 'ADDRESS',
       'CITY', 'STATE', 'ZIP', 'PRICE', 'BEDS', 'BATHS', 'LOCATION',
       'SQUARE FEET', 'LOT SIZE', 'YEAR BUILT', 'DAYS ON MARKET',
       'PRICE PER SQFT', 'HOA PER MONTH', 'STATUS', 'URL', 'SOURCE', 'MLS',
       'LATITUDE', 'LONGITUDE'],
      dtype='object')

Drop redundant columns

In [11]:
prop_df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'], inplace=True)

Visualize spatially¶

Convert to Spatially Enabled DataFrame

In [12]:
prop_sdf = pd.DataFrame.spatial.from_xy(prop_df, 'LONGITUDE','LATITUDE')
In [37]:
pdx_map = gis.map('Portland, OR')
pdx_map.basemap = 'streets'
pdx_map

In [38]:
prop_sdf.spatial.plot(map_widget=pdx_map)
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/numpy/core/_methods.py:32: RuntimeWarning: invalid value encountered in reduce
  return umr_minimum(a, axis, None, out, keepdims, initial)
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/numpy/core/_methods.py:28: RuntimeWarning: invalid value encountered in reduce
  return umr_maximum(a, axis, None, out, keepdims, initial)
Out[38]:
True

Explore density of properties for sale¶

In [8]:
pdx_density_map = gis.map('Portland, OR')
pdx_density_map.basemap='gray'
pdx_density_map

In [12]:
prop_sdf.spatial.plot(map_widget=pdx_density_map, renderer_type='h')
/Users/atma6951/anaconda3/envs/geosaurus_v14/lib/python3.6/site-packages/numpy/core/_methods.py:32: RuntimeWarning: invalid value encountered in reduce
  return umr_minimum(a, axis, None, out, keepdims, initial)
/Users/atma6951/anaconda3/envs/geosaurus_v14/lib/python3.6/site-packages/numpy/core/_methods.py:28: RuntimeWarning: invalid value encountered in reduce
  return umr_maximum(a, axis, None, out, keepdims, initial)
Out[12]:
True

Explore distribution of numeric columns:¶

In [18]:
ax_list = prop_sdf.hist(bins=25, layout=(3,4), figsize=(15,12))

Explore the frequency of categorical columns

In [20]:
fig2, ax2 = plt.subplots(1,2, figsize=(10,5))

prop_sdf['CITY'].value_counts().plot(kind='bar', ax=ax2[0], 
                                             title='City name frequency')
ax2[0].tick_params(labelrotation=45)

prop_sdf['PROPERTY TYPE'].value_counts().plot(kind='bar', ax=ax2[1], 
                                             title='Property type frequency')
ax2[1].tick_params(labelrotation=45)
plt.tight_layout()

Visualize geographic distributions of housing properties¶

  • class color renderer of property age
  • class color of price
  • class color of size
  • hoa vs no hoa

Visualize spatial distribution by property age¶

In [21]:
pdx_age_map = gis.map("Portland, OR")
pdx_age_map.basemap = 'gray-vector'
pdx_age_map

In [22]:
prop_sdf.spatial.plot(map_widget = pdx_age_map, 
                      renderer_type='c', # for classs breaks renderer
                     method='esriClassifyNaturalBreaks',  # classification scheme
                     class_count=10,  # between 1900 - 2000, each decade in a class
                     col='YEAR BUILT',
                     cmap='Blues',  # matplotlib color map
                     alpha=0.7,
                     outline_color=[0,0,0,0])
Out[22]:
True
In [23]:
age_class_breaks = pdx_age_map.layers[0].layer.layerDefinition.drawingInfo.renderer.classBreakInfos
# print(len(age_class_breaks))
cbs_list = []
cmap_list = []
for cb in age_class_breaks:
#     print(cb.description)  # print the class break labels
    cbs_list.append(cb.classMaxValue)
    cmap_list.append([x/255.0 for x in cb.symbol.color])
    
# build a histogram for the same class breaks
n, bins, patches = plt.hist(prop_sdf['YEAR BUILT'], bins=cbs_list)

# apply the same color for each class to match the map
idx = 0
for c, p in zip(bins, patches):
    plt.setp(p, 'facecolor', cmap_list[idx])
    idx+=1

plt.title('Histogram of YEAR BUILT column')
Out[23]:
Text(0.5,1,'Histogram of YEAR BUILT column')

Visualize spatial distribution by price¶

In [24]:
pdx_price_map = gis.map("Portland, OR")
pdx_price_map.basemap = 'gray-vector'
pdx_price_map

In [25]:
prop_sdf.spatial.plot(map_widget = pdx_price_map, 
                      renderer_type='c', # for classs breaks renderer
                     method='esriClassifyNaturalBreaks',  # classification scheme
                     class_count=10,  # between 1900 - 2000, each decade in a class
                     col='PRICE',
                     cmap='BuPu_r',  # matplotlib color map
                     alpha=0.5,
                     outline_color=[50,0,0,50], line_width=1)
Out[25]:
True
In [26]:
price_class_breaks = pdx_price_map.layers[0].layer.layerDefinition.drawingInfo.renderer.classBreakInfos
# print(len(age_class_breaks))
cbs_list = []
cmap_list = []
for cb in price_class_breaks:
#     print(cb.description)  # print the class break labels
    cbs_list.append(cb.classMaxValue)
    cmap_list.append([x/255.0 for x in cb.symbol.color])
    
# build a histogram for the same class breaks
n, bins, patches = plt.hist(prop_sdf['PRICE'], bins=cbs_list)

# apply the same color for each class to match the map
idx = 0
for c, p in zip(bins, patches):
    plt.setp(p, 'facecolor', cmap_list[idx])
    idx+=1

plt.title('Histogram of PRICE column')
Out[26]:
Text(0.5,1,'Histogram of PRICE column')

Visualize spatial distribution of property size¶

In [27]:
pdx_size_map = gis.map("Portland, OR")
pdx_size_map.basemap = 'gray-vector'
pdx_size_map

In [28]:
prop_sdf.spatial.plot(map_widget = pdx_size_map, 
                      renderer_type='c', # for classs breaks renderer
                     method='esriClassifyNaturalBreaks',  # classification scheme
                     class_count=10,  # between 1900 - 2000, each decade in a class
                     col='SQUARE FEET',
                     cmap='RdBu',  # matplotlib color map
                     alpha=0.7,
                     outline_color=[50,0,0,50], line_width=1)
Out[28]:
True
In [29]:
size_class_breaks = pdx_size_map.layers[0].layer.layerDefinition.drawingInfo.renderer.classBreakInfos
# print(len(age_class_breaks))
cbs_list = []
cmap_list = []
for cb in size_class_breaks:
#     print(cb.description)  # print the class break labels
    cbs_list.append(cb.classMaxValue)
    cmap_list.append([x/255.0 for x in cb.symbol.color])
    
# build a histogram for the same class breaks
n, bins, patches = plt.hist(prop_sdf['SQUARE FEET'], bins=cbs_list)

# apply the same color for each class to match the map
idx = 0
for c, p in zip(bins, patches):
    plt.setp(p, 'facecolor', cmap_list[idx])
    idx+=1

plt.title('Histogram of SQUARE FEET column')
Out[29]:
Text(0.5,1,'Histogram of SQUARE FEET column')

Spatial distribution of HoA¶

In [30]:
pdx_hoa_map = gis.map("Portland, OR")
pdx_hoa_map.basemap = 'gray-vector'
pdx_hoa_map

In [40]:
prop_sdf_hoa_f = prop_sdf[prop_sdf['HOA PER MONTH']>=0]
prop_sdf_hoa_f.spatial.plot(map_widget = pdx_hoa_map, 
                      renderer_type='c', # for classs breaks renderer
                     method='esriClassifyQuantile',  # classification scheme
                     class_count=10,  # between 1900 - 2000, each decade in a class
                     col='HOA PER MONTH',
                     cmap='RdBu',  # matplotlib color map
                     alpha=0.7,
                     outline_color=[0,0,0,0], line_width=0)
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/arcgis/features/geo/_accessor.py:861: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._data[col] = self._data[col]
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/arcgis/features/geo/_accessor.py:1968: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._data['OBJECTID'] = list(range(1, self._data.shape[0] + 1))
Out[40]:
True
In [41]:
hoa_class_breaks = pdx_hoa_map.layers[0].layer.layerDefinition.drawingInfo.renderer.classBreakInfos
# print(len(age_class_breaks))
cbs_list = []
cmap_list = []
for cb in hoa_class_breaks:
#     print(cb.description)  # print the class break labels
    cbs_list.append(cb.classMaxValue)
    cmap_list.append([x/255.0 for x in cb.symbol.color])
    
# build a histogram for the same class breaks
n, bins, patches = plt.hist(prop_sdf['HOA PER MONTH'], bins=cbs_list)

# apply the same color for each class to match the map
idx = 0
for c, p in zip(bins, patches):
    plt.setp(p, 'facecolor', cmap_list[idx])
    idx+=1

plt.title('Histogram of HOA PER MONTH column')
Out[41]:
Text(0.5,1,'Histogram of HOA PER MONTH column')

Filter based on your criteria¶

In [48]:
filtered_df = prop_sdf[(prop_df['BEDS']>=2) & 
                       (prop_df['BATHS']>1)& 
                       (prop_df['HOA PER MONTH']<=200) & 
                       (prop_df['YEAR BUILT']>=2000) & 
                       (prop_df['SQUARE FEET'] > 2000) & 
                       (prop_df['PRICE']<=700000)]
filtered_df.shape
Out[48]:
(134, 23)
In [49]:
ax_list2 = filtered_df.hist(bins=25, layout=(4,4), figsize=(15,15))
In [44]:
pdx_filtered_map = gis.map("Portland, OR")
pdx_filtered_map.basemap = 'gray-vector'
pdx_filtered_map

In [45]:
filtered_df.spatial.plot(map_widget=pdx_filtered_map, 
                         renderer_type='c',
                         method='esriClassifyNaturalBreaks',  # classification scheme
                         class_count=10,
                         col='PRICE',
                         cmap='Blues',  # matplotlib color map
                        alpha=0.7,outline_color=[0,0,0,0])
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/arcgis/features/geo/_accessor.py:861: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._data[col] = self._data[col]
/Users/atma6951/anaconda3/envs/geosaurus_gold/lib/python3.6/site-packages/arcgis/features/geo/_accessor.py:1968: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._data['OBJECTID'] = list(range(1, self._data.shape[0] + 1))
Out[45]:
True

Write the shortlisted properties to disk¶

So far, we used attribute queries to explore and filter out properties. We have not yet used GIS analysis to narrow them further. Before that, let us save our work to disk.

In [50]:
filtered_df.to_csv('/Users/atma6951/Documents/GIS_data/Analytics/PDX/houses_for_sale_att_filtered.csv')